Harry Potter - An Analysis on the Moods
Harry Potter - An Analysis on the Moods
- Introduction
- Relevant Packages
- Data Source
- Data Preparation
- A Word Cloud
- Term Frequency
- Term Frequency - Inverse Document Frequency (Unigrams)
- Bigrams
- Term Frequency - Inverse Document Frequency (Bigrams)
- Sentiment of Unigrams
- Sentiment of Bigrams
- Sentiment of Sentences
- Sentiment vs. Book Sales
- Sentiment vs. Box Office Worldwide
- Sentiment vs. Film Rating
- Sentiment vs. Rating (IMDb, Rotten Tomatoes)
- Topic Modeling
- Image Detection
- Takeaway & Looking Forward
Introduction
As the famous saying goes, “there are a thousand Hamlets in a thousand people’s eyes”, this project aims to inspect on this theory through the eyes of modern technology, what sentiment “reeks” of the text in the Harry Potter series by J.K.Rowling.
Questions to answer include, but are not limited to: Does the sentiment analysis on single words tell the true story? How does sentiment change as readers (in this case, RStudio and certain lexicon packages) progress through the chapters in each book? If the sentiment is analyzed on a bigram basis, how does the overall result change? What about analyzing on a sentence basis?
Relevant Packages
Packages used to faciliate this project include: rmdformats, wordcloud, devtools, tidyverse, stringr, tidytext, dplyr, ggraph, ggplot2, sentimentr, rmarkdown, lexicon, textdata, textstem, widyr, readr, tibble, NMF, fmsb, stm, tm.
library(rmdformats)
library(wordcloud)
library(devtools)
library(tidyverse)
library(stringr)
library(tidytext)
library(dplyr)
library(reshape2)
library(igraph)
library(ggraph)
library(sentimentr)
library(rmarkdown)
library(lexicon)
library(textdata)
library(textstem)
library(widyr)
library(readr)
library(tibble)
library(NMF)
library(fmsb)
library(stm)
library(tm)
if (packageVersion("devtools") < 1.6) {
install.packages("devtools")
}Data Source
Data of this project is leveraged from a harrypotter package found on github. It contains text from the seven novels in the Happy Potter series written by J.K.Rowling, which are:
1. philosophers_stone: Harry Potter and The Philosopher’s Stone (1997)
2. chamber_of_secrets: Harry Potter and The Chamber of Secrets (1998)
3. prisoner_of_azkaban: Harry Potter and The Prisoner of Azkaban (1999)
4. goblet_of_fire: Harry Potter and The Goblet of Fire (2000)
5. order_of_the_phoenix: Harry Potter and The Order of the Phoenix (2003)
6. half_blood_prince: Harry Potter and The Half Blood Prince (2005)
7. deathly_hallows: Harry Potter and The Deathly Hallows (2007)
devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
titles <- c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban",
"Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince",
"Deathly Hallows")
books <- list(philosophers_stone, chamber_of_secrets, prisoner_of_azkaban,
goblet_of_fire, order_of_the_phoenix, half_blood_prince,
deathly_hallows)Data Preparation
This data is pretty clean in its nature, but since sentiment analysis is conducted, certain cleaning steps are taken to better prepare the dataset.
Expected Words with High Frequency
Words that are not necessarily considered as normal stop words but will surely showcase high frequency throughout the text include, but are not limited to:
Main Characters: Harry Potter, Hermione Granger, Lord Voldemont, Draco Malfoy, Professor Severus Snape, Ron Weasley, Professor Albus Dumbledore, Dobby the House Elf, and etc. – you name it!
Hogwarts Houses: Gryffindor, Hufflepuff, Ravenclaw, Slytherin
And some other trivial but not so trivial stuff…
# set up a new dataset in addition to the data "stop_words
characterhp <- tibble(word = c(as.character(1:63),"harry", "potter", "professor", "hermione", "granger", "rubeus", "hagrid", "dudley", "dursley", "draco", "malfoy", "severus", "snape", "lord", "voldemort", "albus", "dumbledore", "sirius", "ron", "neville", "dobby", "lupin", "mcgonagall", "newt", "scamander", "grindelwald", "tina", "queenie", "jacob", "harry's", "hogwarts", "weasley", "george", "ginny", "gryffindor", "umbridge", "ron's", "ravenclaw", "slytherin", "hufflepuff", "dumbledore's", "hermione's", "slughorn", "hagrid's", "kreacher", "dementors", "petunia", "snape's", "voldemort's", "dursleys", "malfoy's", "pomfrey", "lockhart's", "riddle's", "gilderoy", "lockhart", "myrtle", "kwikspell", "slughorn's", "quirrell's", "quirrell", " myrtle's", "umbridge's"))The Stop Words
Of course, now a routine, to get rid of the stop words in the text.
Order of the Books
Since later on in this analysis, we’re getting to the point to see how sentiments and moods have evolved as we progress through the chapters, making sure the books are listed in order is rather important.
Term Frequency
Term frequency is a farily important concept to see the terms that frequently occur. Certain insights could be drawn from this result as if a word appears more often, it might have more leverage in impacting the final sentiment scores.
Overall
Start by getting a basic idea of the most frequently used words. “Looked”, “time”, “wand” (undoubtedly) are among the top most frequent used words in all seven books.
#count(hpTotal)
hpTotal <- hp %>%
group_by(word) %>%
summarise(count = n()) %>%
arrange(desc(count))
hpTotal$tfoverall <- hpTotal$count/23712
rmarkdown::paged_table(hpTotal)By Book
Compute the term frequencies of words, show the result of the top 10 used words by book. “Time”, “looked”, and “eyes” show up almost in all books.
hp2 <- hp %>%
count(book, word, sort = TRUE)
hpTF <- hp %>%
# group_by(book, word) %>%
# summarise(count = n())
# select(book, word) %>%
split(., .$book) %>%
lapply(., function(x) {
textTokens = tm::MC_tokenizer(x$word)
total = length(textTokens)
})
temp <- unlist(hpTF)
hpTF <- data.frame(book = c("Deathly Hallows", "Half-BLood Prince", "Order of the Phoenix", "Goblet of Fire", "Prisoner of Azkaban", "Chamber of Secrets", "Philosopher's Stone"),
total = temp)
rownames(hpTF) <- NULL
hpTF <- hp2 %>%
left_join(., hpTF, by = "book")
hpTF$term_frequency <- hpTF$n/hpTF$total
hpTFtop <- hpTF %>%
group_by(book) %>%
arrange(desc(term_frequency)) %>%
slice(1:10)
rmarkdown::paged_table(hpTFtop)Term Frequency - Inverse Document Frequency (Unigrams)
tf-idf shows how important a word is to a document. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
By Book
Compute the tf-idf table by book. The words deemed as “important” are more context-related.
hpTFIDF <- hpTF %>%
tidytext::bind_tf_idf(word, book, n) %>%
arrange(desc(tf_idf)) %>%
select(-term_frequency)
hpTFIDFtop <- hpTFIDF %>%
group_by(book) %>%
arrange(desc(tf_idf)) %>%
slice(1:10)
rmarkdown::paged_table(hpTFIDFtop)Visualization
hpTFIDFtop %>%
ggplot(aes(x = reorder(word, tf_idf), y = tf_idf)) +
geom_col(aes(fill = book), position="identity", alpha=0.5, show.legend = FALSE) +
facet_wrap(~book, ncol = 3, scales = "free_y") +
labs(title = "Words with High tf-idf by Book", x = "Words", y = "tf-idf") +
coord_flip() +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Bigrams
Words should be viewed from a collective sense - because in the literature world, many words are meaningless if analyzed individually. Thus, it might provide more in-depth insight to see what word set are used most frequently. Investigating the bigrams might be a good start.
Most Frequently Used
The most frequently used bigrams are mostly context-related.
hp2 <- tibble()
for(i in seq_along(titles)) {
temp <- tibble(chapter = seq_along(books[[i]]),
text = books[[i]]) %>%
mutate(book = titles[i])
hp2 <- rbind(hp2, temp)
}
hpBigram <- hp2 %>%
select(book, text) %>%
mutate(text = tolower(text)) %>%
mutate(text = lemmatize_strings(text)) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
hpBigram2 <- hpBigram %>%
separate(bigram, c("word1", "word2"), sep = " ")
hpBigram3 <- hpBigram2 %>%
filter(!(word1 %in% stop_words$word)) %>%
filter(!(word2 %in% stop_words$word)) %>%
filter(!(word1 %in% characterhp$word)) %>%
filter(!(word2 %in% characterhp$word))
hpBigramFinal <- hpBigram3 %>%
unite(bigram, word1, word2, sep = " ") %>%
count(bigram, sort = TRUE)
rmarkdown::paged_table(hpBigramFinal)Term Frequency - Inverse Document Frequency (Bigrams)
A Table
From the tf-idf table below, we could see the top 10 most “important” bigrams by book. “Death eater” seems to be pretty important to four out of the seven books in the series, which echoes with it being the bigram with top presence when text from seven books is analyzed as a whole.
hp3 <- tibble()
for(i in seq_along(titles)) {
temp <- tibble(chapter = seq_along(books[[i]]),
text = books[[i]]) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
mutate(book = titles[i]) %>%
select(book, everything())
hp3 <- rbind(hp3, temp)
}
hp3$book <- factor(hp3$book, levels = rev(titles))
hp3 %>%
count(bigram, sort = TRUE)## # A tibble: 340,021 x 2
## bigram n
## <chr> <int>
## 1 of the 4895
## 2 in the 3571
## 3 said harry 2626
## 4 he was 2490
## 5 at the 2435
## 6 to the 2386
## 7 on the 2359
## 8 he had 2138
## 9 it was 2123
## 10 out of 1911
## # … with 340,011 more rows
bigrams_separated <- hp3 %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!(word1 %in% stop_words$word)) %>%
filter(!(word2 %in% stop_words$word)) %>%
filter(!(word1 %in% characterhp$word)) %>%
filter(!(word2 %in% characterhp$word))
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigrams_united %>%
count(bigram, sort = TRUE)## # A tibble: 70,371 x 2
## bigram n
## <chr> <int>
## 1 uncle vernon 386
## 2 death eaters 346
## 3 invisibility cloak 192
## 4 dark arts 176
## 5 death eater 164
## 6 entrance hall 145
## 7 daily prophet 125
## 8 mad eye 116
## 9 hospital wing 107
## 10 madame maxime 94
## # … with 70,361 more rows
hp_bigram_tfidf <- bigrams_united %>%
count(book, bigram) %>%
bind_tf_idf(bigram, book, n) %>%
arrange(desc(tf_idf))
hp_bigram_tfidf_top <- hp_bigram_tfidf %>%
group_by(book) %>%
arrange(desc(tf_idf)) %>%
slice(1:10)
rmarkdown::paged_table(hp_bigram_tfidf_top)Visualization
A visualization of the bigrams with the top tf-idf’s, break down by book in color.
tfidf_plot <- hp_bigram_tfidf %>%
arrange(desc(tf_idf)) %>%
mutate(bigram = factor(bigram, levels = rev(unique(bigram))))
tfidf_plot %>%
top_n(30) %>%
ggplot(aes(bigram, tf_idf, fill = book)) +
geom_col() +
labs(x = NULL, y = "tf-idf", title = "Bigrams with Top tf-idf's", subtitle = "by book") +
coord_flip() +
scale_fill_manual(labels = c("Deathly Hallows", "Half-Blood Prince", "Order of the Phoenix", "Goblet of Fire", "Prisoner of Azkaban", "Chamber of Secrets", "Philosopher's Stone"), values = c("skyblue", "salmon1","plum", "red3", "palegoldenrod", "yellow", "peachpuff3"))Sentiment of Unigrams
Three lexicons are contained in the sentiments dataset of the tidytext package, which are AFINN, bing, and nrc. All of the three lexicons are based on unigrams, and these single words are assigned scores for sentiment according to their different categorization. We’re going through each of these three lexicons in this section.
nrc Lexicon
The nrc lexicon regulates words in a binary fashion (0 or 1, a.k.a no or yes) into ten categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
According to the nrc lexicon, most words are ruled as “negative”, while the runner-up is “positive”, with a difference of roughly 5500 words. And then it’s “fear”, and “trust”, which is about 2000 words shy. In general, there is a stronger negative presence than positive.
nrcfirst <- hp %>%
right_join(get_sentiments("nrc")) %>%
filter(!is.na(sentiment)) %>%
count(sentiment, sort = TRUE)
rmarkdown::paged_table(nrcfirst)AFINN Lexicon
The AFINN lexicon maps words with a score ranging from -5 to 5. A negative score indicates negative sentiment and a positive score indicates positive sentiment.
From this result, we could see that the words with the highest absolute values of sentiment are mostly negative.
sentiafinn <- hp %>%
group_by(word) %>%
# mutate(word_count = 1:n(),
# index = word_count %/% 500 + 1) %>%
inner_join(get_sentiments("afinn")) %>%
summarise(sentiment = sum(value)) %>%
arrange(desc(abs(sentiment)))
rmarkdown::paged_table(sentiafinn)bing Lexicon
The bing lexicon assigns words in a binary fashion into positive and negative.
Overall
We can see that according to “bing”, around 68% of all the words are catgeorized as negative.
hp %>%
right_join(get_sentiments("bing")) %>%
filter(!is.na(sentiment)) %>%
count(sentiment, sort = TRUE)## Joining, by = "word"
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 38973
## 2 positive 17938
hp %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("plum3", "rosybrown3"),
max.words = 60)## Joining, by = "word"
By Book
According to “bing”, generate how the net sentiment changes as readers progress through each book. For a better output, break up each book by 500 words (the approximate number of words on every two pages) and store them into index accordingly.
The sentiment shows more variation in all six groups, except for Order of the Phoenix - which shows more negativity throughout the whole book.
hp %>%
group_by(book) %>%
mutate(word_count = 1:n(),
index = word_count %/% 500 + 1) %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = index , sentiment) %>%
ungroup() %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative,
book = factor(book, levels = titles)) %>%
ggplot(aes(index, sentiment, fill = book)) +
geom_bar(alpha = 0.5, stat = "identity", show.legend = FALSE) +
facet_wrap(~ book, ncol = 2, scales = "free_x") +
labs(title = "Sentiment Score vs. Reading Progress", subtitle = "based on bing lexicon, by book, grouped by each 500 words", x = "(#words/500)", y = "sentiment score")Comparison
To see how AFINN, bing, and nrc lexicons lead to different results of the sentiment analysis. It might be helpful to benchmark the results. For a better output, break up each book by 500 words (the approximate number of words on every two pages) and store them into index accordingly. The result shows how these lexicons impact the sentiment scores by each book to investigate questions like: is there any book that has similar sentiment scores from the three lexicons?
From the result, we could see that nrc produces different sentiment scores, while AFINN and bing might show more similar pattern in categorizing words. However, how much coincidence is factored in this result needs further investigation.
sentiafinn <- hp %>%
group_by(book) %>%
mutate(word_count = 1:n(),
index = word_count %/% 500 + 1) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(book, index) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
#sentibing <- hp %>%
# group_by(book) %>%
# mutate(word_count = 1:n(),
# index = word_count %/% 500 + 1) %>%
# inner_join(get_sentiments("bing")) %>%
# mutate(method = "Bing")
#sentinrc <- hp %>%
# group_by(book) %>%
# mutate(word_count = 1:n(),
# index = word_count %/% 500 + 1) %>%
# inner_join(get_sentiments("nrc") %>%
# filter(sentiment %in% c("positive", "negative"))) %>%
# mutate(method = "NRC")
sentibingnrc <- bind_rows(hp %>%
group_by(book) %>%
mutate(word_count = 1:n(),
index = word_count %/% 500 + 1) %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "bing"),
hp %>%
group_by(book) %>%
mutate(word_count = 1:n(),
index = word_count %/% 500 + 1) %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative"))) %>%
mutate(method = "nrc")) %>%
count(book, method, index = index , sentiment) %>%
ungroup() %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
select(book, index, method, sentiment)
bind_rows(sentiafinn,
sentibingnrc) %>%
ungroup() %>%
mutate(book = factor(book, levels = titles)) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_grid(book ~ method) +
scale_fill_manual(labels = letters[1:3], values = c("coral3", "darkgoldenrod1","aquamarine4"))+
labs(title = "Sentiment Score by Different Lexicons", subtitle = "by book, grouped by every 500 words", x = "(#words/500)", y = "sentiment score")Sentiment of Bigrams
As sentiment analysis might be misleading when a negation word is presented right before the word that is mapped to a sentiment/emotion lexicon word. Some common negation words include, but are limited to “no”, “not”, “doesn’t”, “isn’t”, “wasn’t’,”can’t“,”hardly“,”barely“,”wasn’t“,”can’t“,”hardly“,”barely“,”won’t“,”don’t“,”shouldn’t“,”wouldn’t“,”couldn’t", and etc. Thus, the following analysis aims to see if what words that are previously done in the sentiment analysis might lead to outcome of opposite direction.
A Table
The table shows the most used bigrams that start with a negation word.
bigrams_separated_negation <- bigrams_separated %>%
filter(word1 == "no" | word1 == "not" | word1 == "doesn't" | word1 == "isn't" | word1 == "wasn't" | word1 == "can't" | word1 == "hardly" | word1 == "barely" | word1 == "won't" | word1 == "don't" | word1 == "shouldn't" | word1 == "wouldn't" | word1 == "couldn't") %>%
filter(!word2 %in% stop_words$word) %>%
count(word1, word2, sort = TRUE)
rmarkdown::paged_table(bigrams_separated_negation)The (Probably) Falsely Categorized Sentiment
Now, incorporate the bing lexicon again. Solely from scanning through the first few pages of the table, we could see that most used second words that follow a negation word are categorized as negative. Thus, this might be the cause of negative sentiment of single words being much more of that the positive sentiment.
bingtemp <- get_sentiments("bing")
bigrams_separated_negation2 <- bigrams_separated_negation %>%
filter(word1 == "no" | word1 == "not" | word1 == "doesn't" | word1 == "isn't" | word1 == "wasn't" | word1 == "can't" | word1 == "hardly" | word1 == "barely" | word1 == "won't" | word1 == "don't" | word1 == "shouldn't" | word1 == "wouldn't" | word1 == "couldn't") %>%
filter(!word2 %in% stop_words$word) %>%
inner_join(bingtemp, by = c(word2 = "word")) %>%
ungroup()
rmarkdown::paged_table(bigrams_separated_negation2)Sentiment of Sentences
Though tokenizing at the single word level might be useful in some scenarios, but conducting sentiment analysis on a different units of text might be more applicable in the literature word. In the sentimentr, there are analysis algorithms that grant sentiment analysis on a sentence level. The following results would show in each book, the net sentiment by chapter and then by sentence. The following analysis is done based on AFINN since the sentiment score is of a laid-out range.
Harry Potter and The Philosopher’s Stone
first_sentence <- tibble(chapter = 1:length(philosophers_stone),
text = philosophers_stone) %>%
unnest_tokens(sentence, text, token = "sentences")
first_sentence <- first_sentence %>%
group_by(chapter) %>%
mutate(sentence_num = 1:n(),
index = round(sentence_num / n(), 2)) %>% # to track sentence
unnest_tokens(word, sentence) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(chapter, index) %>%
summarise(sentiment = sum(value, na.rm = TRUE)) %>%
arrange(desc(sentiment))
ggplot(first_sentence, aes(index, factor(chapter, levels = sort(unique(chapter), decreasing = TRUE)), fill = sentiment)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "cadetblue4", mid = "white",
high = "darkseagreen", midpoint = 0, space = "Lab", guide = "colourbar", aesthetics = "fill") +
scale_x_continuous(labels = scales::percent, expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
labs(x = "Progress", y = "Chapter") +
ggtitle("Sentiment of Harry Potter and The Philosopher's Stone",
subtitle = "net sentiment scores, based on AFINN Lexicon") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "top")Harry Potter and The Chamber of Secrets
second_sentence <- tibble(chapter = 1:length(chamber_of_secrets),
text = chamber_of_secrets) %>%
unnest_tokens(sentence, text, token = "sentences")
second_sentence <- second_sentence %>%
group_by(chapter) %>%
mutate(sentence_num = 1:n(),
index = round(sentence_num / n(), 2)) %>%
unnest_tokens(word, sentence) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(chapter, index) %>%
summarise(sentiment = sum(value, na.rm = TRUE)) %>%
arrange(desc(sentiment))
ggplot(second_sentence, aes(index, factor(chapter, levels = sort(unique(chapter), decreasing = TRUE)), fill = sentiment)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "cadetblue4", mid = "white",
high = "darkseagreen", midpoint = 0, space = "Lab", guide = "colourbar", aesthetics = "fill") +
scale_x_continuous(labels = scales::percent, expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
labs(x = "Progress", y = "Chapter") +
ggtitle("Sentiment of Harry Potter and The Chamber of Secrets",
subtitle = "net sentiment scores, based on AFINN Lexicon") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "top")Harry Potter and The Prisoner of Azkaban
third_sentence <- tibble(chapter = 1:length(prisoner_of_azkaban),
text = prisoner_of_azkaban) %>%
unnest_tokens(sentence, text, token = "sentences")
third_sentence <- third_sentence %>%
group_by(chapter) %>%
mutate(sentence_num = 1:n(),
index = round(sentence_num / n(), 2)) %>%
unnest_tokens(word, sentence) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(chapter, index) %>%
summarise(sentiment = sum(value, na.rm = TRUE)) %>%
arrange(desc(sentiment))
ggplot(third_sentence, aes(index, factor(chapter, levels = sort(unique(chapter), decreasing = TRUE)), fill = sentiment)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "cadetblue4", mid = "white",
high = "darkseagreen", midpoint = 0, space = "Lab", guide = "colourbar", aesthetics = "fill") +
scale_x_continuous(labels = scales::percent, expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
labs(x = "Progress", y = "Chapter") +
ggtitle("Sentiment of Harry Potter and The Prisoner of Azkaban",
subtitle = "net sentiment scores, based on AFINN Lexicon") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "top")Harry Potter and The Goblet of Fire
fourth_sentence <- tibble(chapter = 1:length(goblet_of_fire),
text = goblet_of_fire) %>%
unnest_tokens(sentence, text, token = "sentences")
fourth_sentence <- fourth_sentence %>%
group_by(chapter) %>%
mutate(sentence_num = 1:n(),
index = round(sentence_num / n(), 2)) %>%
unnest_tokens(word, sentence) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(chapter, index) %>%
summarise(sentiment = sum(value, na.rm = TRUE)) %>%
arrange(desc(sentiment))
ggplot(fourth_sentence, aes(index, factor(chapter, levels = sort(unique(chapter), decreasing = TRUE)), fill = sentiment)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "cadetblue4", mid = "white",
high = "darkseagreen", midpoint = 0, space = "Lab", guide = "colourbar", aesthetics = "fill") +
scale_x_continuous(labels = scales::percent, expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
labs(x = "Progress", y = "Chapter") +
ggtitle("Sentiment of Harry Potter and The Goblet of Fire",
subtitle = "net sentiment scores, based on AFINN Lexicon") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "top")Harry Potter and The Order of the Phoenix
fifth_sentence <- tibble(chapter = 1:length(order_of_the_phoenix),
text = order_of_the_phoenix) %>%
unnest_tokens(sentence, text, token = "sentences")
fifth_sentence <- fifth_sentence %>%
group_by(chapter) %>%
mutate(sentence_num = 1:n(),
index = round(sentence_num / n(), 2)) %>%
unnest_tokens(word, sentence) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(chapter, index) %>%
summarise(sentiment = sum(value, na.rm = TRUE)) %>%
arrange(desc(sentiment))
ggplot(fifth_sentence, aes(index, factor(chapter, levels = sort(unique(chapter), decreasing = TRUE)), fill = sentiment)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "cadetblue4", mid = "white",
high = "darkseagreen", midpoint = 0, space = "Lab", guide = "colourbar", aesthetics = "fill") +
scale_x_continuous(labels = scales::percent, expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
labs(x = "Progress", y = "Chapter") +
ggtitle("Sentiment of Harry Potter and The Order of the Phoenix",
subtitle = "net sentiment scores, based on AFINN Lexicon") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "top")Harry Potter and The Half Blood Prince
sixth_sentence <- tibble(chapter = 1:length(half_blood_prince),
text = half_blood_prince) %>%
unnest_tokens(sentence, text, token = "sentences")
sixth_sentence <- sixth_sentence %>%
group_by(chapter) %>%
mutate(sentence_num = 1:n(),
index = round(sentence_num / n(), 2)) %>%
unnest_tokens(word, sentence) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(chapter, index) %>%
summarise(sentiment = sum(value, na.rm = TRUE)) %>%
arrange(desc(sentiment))
ggplot(sixth_sentence, aes(index, factor(chapter, levels = sort(unique(chapter), decreasing = TRUE)), fill = sentiment)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "cadetblue4", mid = "white",
high = "darkseagreen", midpoint = 0, space = "Lab", guide = "colourbar", aesthetics = "fill") +
scale_x_continuous(labels = scales::percent, expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
labs(x = "Progress", y = "Chapter") +
ggtitle("Sentiment of Harry Potter and The Half Blood Prince",
subtitle = "net sentiment scores, based on AFINN Lexicon") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "top")Harry Potter and The Deathly Hallows
seven_sentence <- tibble(chapter = 1:length(deathly_hallows),
text = deathly_hallows) %>%
unnest_tokens(sentence, text, token = "sentences")
seven_sentence <- seven_sentence %>%
group_by(chapter) %>%
mutate(sentence_num = 1:n(),
index = round(sentence_num / n(), 2)) %>%
unnest_tokens(word, sentence) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(chapter, index) %>%
summarise(sentiment = sum(value, na.rm = TRUE)) %>%
arrange(desc(sentiment))
ggplot(seven_sentence, aes(index, factor(chapter, levels = sort(unique(chapter), decreasing = TRUE)), fill = sentiment)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "cadetblue4", mid = "white",
high = "darkseagreen", midpoint = 0, space = "Lab", guide = "colourbar", aesthetics = "fill") +
scale_x_continuous(labels = scales::percent, expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
labs(x = "Progress", y = "Chapter") +
ggtitle("Sentiment of Harry Potter and The Deathly Hallows",
subtitle = "net sentiment scores, based on AFINN Lexicon") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "top")Sentiment vs. Book Sales
U.S. First Print Run
Since sales from the first print run has a lot to do with the series’ burgeoning popularity overtime. Thus, it would be fair to compare the sales of first print run among the first three books, and then among the last four books. As the sentiment does not really differ among the first three, we could not really see a pattern nor association here. Among Order of the Phoenix, Half-blood Prince and Goblet of Fire, sentiment seems to have certain effect on the sales: the higher (more positive) of the sentiment score, the higher the sales is.
Another way of looking at this first print run sales issue might be that: does the sentiment score of the previous book have impact on the sales of the following book? For example, it might be due to the positivity shown in Half-Blood Prince, that the sales of Deathly Hallows is relatively high. But this might also due to the fact that Deathly Hallows is the highly expected last book in this series.
sentiafinn3 <- hp %>%
group_by(book, word) %>%
inner_join(get_sentiments("afinn")) %>%
summarise(sentiment = sum(value)) %>%
group_by(book) %>%
summarise(sentiment = mean(sentiment))
hpbooksales <- read.csv("/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/HPSales.csv")
hpbooksalessenti <- hpbooksales %>%
left_join(sentiafinn3)
hpbooksalessenti[1,5] = "-4.427644"
hpbooksalessenti[2,5] = "-2.202925"
hpbooksalessenti[4,5] = "-2.950355"
hpbooksalessenti$sentiment <- as.numeric(hpbooksalessenti$sentiment)
hpbooksalessenti %>%
ggplot(aes(x = book, y = us_firstrun, fill = sentiment)) +
geom_bar(stat="identity") +
scale_fill_gradient2(low = "deepskyblue3", high = "aliceblue", midpoint = -0.3, aesthetics = "fill") +
coord_flip() +
theme_classic() +
labs(title = "Sentiment vs. U.S. First Print Run", subtitle = "sentiment analysis based on AFINN lexicon", y = "Sales", x = "Book")UK First Print Run
In UK, it might be that the Goblet of Fire shows more positivity, so the sales of Order of the Phoenix is relatively higher. Again, the logic of this explanation might be a bit far-fetched since in real life scenarios, many things would have impacted the sales of a certain book in a series.
hpbooksalessenti %>%
ggplot(aes(x = book, y = uk_firstrun, fill = sentiment)) +
geom_bar(stat="identity") +
scale_fill_gradient2(low = "deepskyblue3", high = "aliceblue", midpoint = -0.3, aesthetics = "fill") +
coord_flip() +
theme_classic() +
labs(title = "Sentiment vs. UK First Print Run", subtitle = "sentiment analysis based on AFINN lexicon", y = "Sales", x = "Book")Total Copies Sold Worldwide
Viewing this issue from the worldwide sales perspective, sentiment of the previous book or the book does not necessarily have anything to do with the book sales.
hpbooksalessenti %>%
ggplot(aes(x = book, y = worldwide_sales, fill = sentiment)) +
geom_bar(stat="identity") +
scale_fill_gradient2(low = "deepskyblue3", high = "aliceblue", midpoint = -0.3, aesthetics = "fill") +
coord_flip() +
theme_classic() +
labs(title = "Sentiment vs. Worldwide Sales", subtitle = "sentiment analysis based on AFINN lexicon", y = "Sales", x = "Book")Sentiment vs. Box Office Worldwide
Note that there is one movie corresponding to one book, except for there are part 1 and part 2 for Harry Potter and the Deathly Hallows. Thus, for this section, the box office corresponding to the last book is calculated as the average of part 1 (960.4 million USD) and part 2 (1.342 billion USD), which is 1.1512 billion USD.
The result (sort of) shows a relationship that if the average sentiment in each book (based on AFINN sentiment) is higher/positive (in despite of average sentiment for every book is actually negative), the higher the box office will be. However, other factors need to be taken into account, such as the first and the last movies are likely to launch higher box office.
sentiafinn2 <- hp %>%
group_by(book, word) %>%
# mutate(word_count = 1:n(),
# index = word_count %/% 500 + 1) %>%
inner_join(get_sentiments("afinn")) %>%
summarise(sentiment = sum(value)) %>%
group_by(book) %>%
summarise(sentiment = mean(sentiment))
hpboxoffice <- read.csv("/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/HPBoxOffice.csv")
names(hpboxoffice)[names(hpboxoffice)=="movie"] <- "book"
hpboxofficesenti <- hpboxoffice %>%
right_join(sentiafinn2)
hpboxofficesenti[1,2] = "1151200000"
hpboxofficesenti[2,2] = "934500000"
hpboxofficesenti[4,2] = "897100000"
hpboxofficesenti$box_office <- as.numeric(hpboxofficesenti$box_office)
hpboxofficesenti$sentiment <- as.numeric(hpboxofficesenti$sentiment )
hpboxofficesenti %>%
ggplot(aes(x = book, y = box_office, fill = sentiment)) +
geom_bar(stat="identity") +
scale_fill_gradient2(low = "bisque3", high = "bisque1", midpoint = -0.3, aesthetics = "fill") +
coord_flip() +
theme_classic() +
labs(title = "Sentiment vs. Box Office", subtitle = "based on AFINN lexicon", y = "Box Office (USD)", x = "Book/Movie")Sentiment vs. Film Rating
Since negativity in text of book might indicate possible frequent presence of strong language, violence, drug abuse, nudity, and etc. in the corresponding movie. Again, since both part 1 and part 2 of Deathly Hallows were given the same film rate, here these two movies are synthesized into one row of entry.
Here, we could see that sentiment of the book text is sort of associated with the film rates of the corresponding movie. But this result might be due to the fact that the average sentiment scores are not really different from each other among the seven books and thus the eight movies. Also, the lines and scripts of the movies are not necessarily the same of what the book tells; thus solely using the sentiment reflected in the book text to associate it with the film ratings of the movies might not be a fairly logical approach.
hpfilmrating <- read.csv("/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/HPFilmRating.csv")
names(hpfilmrating)[names(hpfilmrating)=="movie"] <- "book"
hpfrsenti <- hpfilmrating %>%
right_join(sentiafinn2)
hpfrsenti[1,2] = "PG-13"
hpfrsenti[2,2] = "PG-13"
hpfrsenti[4,2] = "PG-13"
hpfrsenti$film_rate <- as.factor(hpfrsenti$film_rate)
hpfrsenti <- hpfrsenti %>%
arrange(desc(sentiment))
names(hpfrsenti)[names(hpfrsenti)=="book"] <- "book/movie"
rmarkdown::paged_table(hpfrsenti)Sentiment vs. Rating (IMDb, Rotten Tomatoes)
It might be interesting to see if there is any relationship between the sentiment of book text and the ratings for the corresponding movies - since more negativity might actually draw more audience (because it is kind of the trend for a while). For the last book, there are two movies, part 1 gets 7.7/10 on IMDb, and 77% on Rotten Tomatoes; part 2 gets 8.1/10 on IMDb, and 96% on Rotten Tomatoes. The numbers used here, in order to map each book to “one” movie, are the averages of the ratings for part 1 and part 2.
From this plot, we could see that the online ratings for the movie do not have certain correlation with the sentiment of the book that it corresponds.
hprating <- read.csv("/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/HPRating.csv")
names(hprating)[names(hprating)=="movie"] <- "book"
hpratingsenti <- hprating %>%
right_join(sentiafinn2)
hpratingsenti[1,2] = "7.9"
hpratingsenti[2,2] = "7.6"
hpratingsenti[4,2] = "7.7"
hpratingsenti[1,3] = "8.65"
hpratingsenti[2,3] = "8.3"
hpratingsenti[4,3] = "8.8"
hpratingsenti[1,4] = "8.275"
hpratingsenti[2,4] = "7.95"
hpratingsenti[4,4] = "8.25"
hpratingsenti[1,5] = "19-Nov-10"
hpratingsenti[2,5] = "15-Jul-09"
hpratingsenti[4,5] = "18-Nov-05"
hpratingsenti$imdb <- as.numeric(hpratingsenti$imdb)
hpratingsenti$rotten_tomatoes <- as.numeric(hpratingsenti$rotten_tomatoes)
hpratingsenti$avg_two <- as.numeric(hpratingsenti$avg_two)
hpratingsenti$sentiment <- as.numeric(hpratingsenti$sentiment)
hpratingsenti$release_date <- as.Date(hpratingsenti$release_date, "%d-%B-%y")
ggplot(hpratingsenti)+
geom_line(aes(x = release_date, y = imdb), color = "azure3") +
geom_line(aes(x = release_date, y = rotten_tomatoes), color = "coral2") +
geom_line(aes(x = release_date, y = avg_two), color = "cadetblue3") +
geom_point(mapping = aes(release_date, imdb, color = sentiment)) +
geom_point(mapping = aes(release_date, rotten_tomatoes, color = sentiment)) +
geom_point(mapping = aes(release_date, avg_two, color = sentiment)) +
scale_color_gradient(low = "gold4", high = "yellow") +
theme_bw() +
labs(title = "Sentiment vs. Online Ratings for Corresponding HP Movie", x = "Release Date", y = "Ratings(out of 10)")Topic Modeling
preparation work
hp7 <- hp2
hp7$id <- seq.int(nrow(hp7))
cleanhp <- hp7 %>%
mutate(text = as.character(text),
text = str_replace_all(text, "\n", " "),
text = str_replace_all(text, "(\\[.*?\\])", ""),
text = str_squish(text),
text = gsub("([a-z])([A-Z])", "\\1 \\2", text),
text = tolower(text),
text = removeWords(text, c("’", stopwords(kind = "en"))),
text = removePunctuation(text),
text = removeNumbers(text),
text = textstem::lemmatize_strings(text),
doc_id = id,
book = book) %>%
select(doc_id, text, book) %>%
# anti_join(characterhp) %>%
as.data.frame()
hpcorpus = Corpus(DataframeSource(cleanhp))preparation work (cont.)
hpTDM = TermDocumentMatrix(hpcorpus, control = list(weighting =
function(x)
weightTfIdf(x, normalize =
FALSE)))
inspect(hpTDM)## <<TermDocumentMatrix (terms: 19196, documents: 200)>>
## Non-/sparse entries: 209841/3629359
## Sparsity : 95%
## Maximal term length: 49
## Weighting : term frequency - inverse document frequency (tf-idf)
## Sample :
## Docs
## Terms 107 120 125 126 171 186
## — 0.000000 0.000000 0.0000000 0.000000 0.000000 0.0000000
## ¨c 0.000000 0.000000 0.0000000 0.000000 92.507307 129.0233497
## dobby 0.000000 0.000000 0.0000000 0.000000 0.000000 32.2759186
## hagrid 1.449986 16.674838 83.3741895 20.299803 2.899972 0.0000000
## lupin 1.415037 0.000000 0.0000000 2.830075 5.660150 0.0000000
## sirius 2.028999 6.086997 3.0434987 2.028999 0.000000 0.0000000
## slughorn 0.000000 0.000000 0.0000000 0.000000 0.000000 0.0000000
## snape 14.002346 5.895725 0.7369656 3.684828 0.000000 0.7369656
## umbridge 123.214085 22.631159 30.1748781 37.718598 0.000000 0.0000000
## vernon 0.000000 0.000000 0.0000000 0.000000 0.000000 15.0874390
## Docs
## Terms 189 194 196 81
## — 0.000000 0.000000 0.000000 0.000000
## ¨c 38.950445 165.539392 177.711406 0.000000
## dobby 2.689660 0.000000 0.000000 59.172517
## hagrid 0.724993 5.799944 0.000000 10.874894
## lupin 0.000000 1.415037 8.490225 0.000000
## sirius 0.000000 0.000000 7.101497 2.028999
## slughorn 0.000000 0.000000 0.000000 0.000000
## snape 0.000000 1.473931 78.855319 5.158759
## umbridge 0.000000 0.000000 0.000000 0.000000
## vernon 0.000000 0.000000 0.000000 2.514573
preparation work (cont.)
hpConvert <- as.data.frame(as.matrix(hpTDM))
hpTibble = as_tibble(hpConvert, .name_repair = "universal")## New names:
## * `1` -> ...1
## * `2` -> ...2
## * `3` -> ...3
## * `4` -> ...4
## * `5` -> ...5
## * … and 195 more problems
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
## Warning: Setting row names on a tibble is deprecated.
hpNMF <- nmf(hpTibble, 4, seed = 1001)
save(hpNMF, file = "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/nmfOut.Rata")
load("/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/nmfOut.Rata")
wMatrix = as.data.frame(basis(hpNMF))
head(wMatrix[order(-wMatrix$V1), ], 5)## V1 V2 V3 V4
## umbridge 20234.721 102.8372 2.220446e-16 5.849498e+01
## professor 11051.292 103.7483 1.337324e+03 2.362579e+03
## lupin 10190.473 4246.1728 2.127880e+03 2.220446e-16
## snape 10047.151 3759.9352 7.032213e+00 3.092425e+03
## hagrid 8475.255 705.3663 3.947349e+03 5.777697e+03
## V1 V2 V3 V4
## ¨c 2.220446e-16 42063.296 5.089918e-11 5.691338e-10
## griphook 1.018192e+01 8475.596 8.811202e+02 2.836854e+00
## eater 6.882399e+02 8426.354 2.003849e+02 1.257583e+03
## voldemort 6.814381e+02 7182.504 9.542275e+02 1.380955e+03
## kreacher 6.581217e+02 7055.402 3.277940e+03 1.351365e+03
## V1 V2 V3 V4
## vernon 2.220446e-16 8.319926e+01 16320.80 9.286498e-09
## uncle 8.879144e+01 8.458021e-13 13430.88 2.047613e+02
## dudley 4.120097e+01 1.197749e+01 12426.54 1.550563e-10
## petunia 2.673152e+01 4.371715e+02 10587.57 1.654380e-04
## mrs 1.514712e+02 1.543847e+03 10455.68 1.166612e+03
## V1 V2 V3 V4
## — 2.220446e-16 3.722575e-15 2.220446e-16 33665.324
## slughorn 3.707542e+01 3.896770e+02 8.329534e-12 15004.200
## bagman 2.602750e+01 3.562215e+00 4.700756e+01 10867.256
## winky 5.316458e+01 3.884784e+00 1.158944e+01 9951.494
## moody 2.805123e+02 2.878832e+02 3.147092e+03 9827.710
preparation work (cont.)
set.seed(1001)
holdoutRows <- sample(1:nrow(hp7), 100, replace = FALSE)
hp_text <- textProcessor(documents = hp7$text[-c(holdoutRows)],
metadata = hp7[-c(holdoutRows), ],
stem = FALSE)## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Creating Output...
preparation work (cont.)
## Removing 11663 of 22659 terms (11663 of 136872 tokens) due to frequency
## Your corpus now has 100 documents, 10996 terms and 125209 tokens.
To determine the best number of topics, focus on the semantic coherence and the residuals. Semantic coherence means how well the words hang together – computed from taking a conditional probability score from frequent words; it is essentially a measure of human interpretability. Thus, low residual and high semantic coherence is preferred. In this case, as k passes 10, the residuals seemingly take a sharp dive as k increases. Thus, choose 10 as the k.
ktest = searchK(documents = hp_prep$documents,
vocab = hp_prep$vocab,
K = c(3, 4, 5, 10, 20), verbose = FALSE)
plot(ktest)From the plot down below, we could see that topic 2 is the most expected topic is topic 2, which contains words like “harry”, “ron”, and “said”. (of course - but a better way to do this might be finding a way to eliminate all words related to main characters, the Hogwarts Houses, and etc. in the first place). The topic that is expected to cover the second most proportion is topic 5, which has words “said”, and “harry”and “ron” again. As you move through the ten topics, all contain harry, and come with other main characters’ names like hermione and dumbledore!
topics10 <- stm(documents = hp_prep$documents,
vocab = hp_prep$vocab, seed = 1001,
K = 10, verbose = FALSE)
plot(topics10)Now, see what emerges from the topics in details. FREX words are probably what we should focus on here, since they occur frequently within the topic and are exclusive to that topic. Here, in topic 2: FEEX words are umbridge, parvati, cho, angelina, potions, katie, dobby. Also, the highest probability words are worthy of attention because there are the words with the highest probability of occuring within that topic. Here, under topic 2, highest probability words are said, harry, ron, hermione, just, well, professor. All fairly forseeable.
## Topic 1 Top Words:
## Highest Prob: harry, said, uncle, vernon, aunt, dudley, petunia
## FREX: marge, dudley, aunt, vernon, uncle, petunia, dudleys
## Lift: brutuss, horsy, mentally, pixies, sightless, venemous, waistband
## Score: marge, vernon, dudleys, petunia, figg, uncle, dursley
## Topic 2 Top Words:
## Highest Prob: said, harry, ron, hermione, just, well, professor
## FREX: umbridge, parvati, cho, angelina, potions, katie, dobby
## Lift: bezoar, hearthrug, losers, -one, accurate, afternoons, applied
## Score: katie, angelina, parvati, umbridge, kreacher, slughorn, luna
## Topic 3 Top Words:
## Highest Prob: harry, said, moody, ron, around, back, cedric
## FREX: krum, cedric, champions, moody, task, moodys, tournament
## Lift: -large, acceptance, bubbly, diamond, disqualified, fiber, flicker
## Score: moody, champions, krum, bagman, merpeople, karkaroff, enclosure
## Topic 4 Top Words:
## Highest Prob: said, harry, dumbledore, snape, now, back, looked
## FREX: fudge, dumbledore, slughorn, pensieve, memory, snape, morfin
## Lift: arabella, augustus, bites, blades, blaming, characteristic, cheerfulness
## Score: slughorn, morfin, fudge, snape, hepzibah, crouch, figg
## Topic 5 Top Words:
## Highest Prob: harry, said, ron, hermione, professor, back, lupin
## FREX: scabbers, wood, crookshanks, gryffindor, pettigrew, lupin, team
## Lift: axe, execution, flobberworms, malfay, nooo, prongs, sabotage
## Score: scabbers, hooch, lupin, flint, pettigrew, rat, blacks
## Topic 6 Top Words:
## Highest Prob: said, harry, weasley, ron, mrs, hermione, fred
## FREX: scrimgeour, weasley, mrs, weasleys, arthur, percy, charlie
## Lift: -heat, -tray, accommodate, accuse, adoringly, ancestors, announcement
## Score: scrimgeour, tents, roberts, kreacher, bagman, crouch, charlie
## Topic 7 Top Words:
## Highest Prob: harry, said, hermione, ron, dumbledore, know, wand
## FREX: xenophilius, hallows, grindelwald, deathly, godrics, lily, kreacher
## Lift: deathstick, erumpent, gellert, hallows, ¨cit, ¨cof, againso
## Score: xenophilius, hallows, grindelwald, kreacher, griphook, deathly, regulus
## Topic 8 Top Words:
## Highest Prob: harry, said, hagrid, hermione, ron, back, yeh
## FREX: yeh, ter, yer, hagrid, don, rita, fer
## Lift: bidin, breeds, couldn, crinkled, croakily, crocodile-skin, determinedly
## Score: hagrid, yeh, yer, bagman, rita, ollivander, ter
## Topic 9 Top Words:
## Highest Prob: harry, said, wand, hermione, ron, back, around
## FREX: bellatrix, greyback, prophecy, cattermole, diary, travers, riddle
## Lift: ¨cstop, arch, battling, bawled, betrayal, black-hooded, blistered
## Score: griphook, travers, greyback, bellatrix, bellatrixs, dais, cattermole
## Topic 10 Top Words:
## Highest Prob: harry, said, back, hagrid, ron, hermione, looked
## FREX: diadem, tonks, nick, kingsley, motorbike, mad-eye, crabbe
## Lift: better-looking, carrows, dedaluss, dizzying, hagridharry, motorbike, whereve
## Score: diadem, motorbike, kingsley, tonks, luna, dedalus, bike
Now, see what text from the book that have higher probabilities of being associated with each topic. After computing this step, the result appears to be a LONG excerption from the book (apparently), thus the output here is hidden.
Next step, since topic models are probablistic, compute he probabilities of each document belonging to a topic.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.2730288220 4.266334e-04 3.724610e-04 2.707508e-05 0.0008390219
## [2,] 0.0001049162 1.810393e-04 1.498009e-04 8.360836e-06 0.0002413514
## [3,] 0.0094009254 2.739393e-04 2.218427e-04 4.836820e-04 0.7071116815
## [4,] 0.0002889296 1.472327e-04 2.210055e-05 9.385872e-05 0.9986132588
## [5,] 0.0001760840 1.484458e-04 2.218044e-05 9.395180e-05 0.9987215784
## [6,] 0.0015030774 7.539111e-04 8.299145e-04 2.074743e-04 0.1622878153
## [7,] 0.0042489279 2.258018e-04 2.886410e-04 1.699895e-04 0.0006834611
## [8,] 0.6935701057 4.590232e-04 1.434213e-04 1.367515e-04 0.3037400590
## [9,] 0.0004164102 6.869285e-04 2.069677e-04 6.839662e-05 0.4600250960
## [10,] 0.0001110533 9.973328e-01 2.689690e-04 8.800624e-04 0.0005345626
## [11,] 0.0008652314 5.908224e-01 6.753583e-04 3.614958e-03 0.2198512988
## [12,] 0.0005260726 1.712185e-03 4.997754e-01 5.264215e-04 0.0018317711
## [13,] 0.0042321812 1.333026e-03 3.851515e-04 2.028674e-01 0.2160993806
## [14,] 0.9988696781 1.063736e-04 1.211603e-04 5.682283e-05 0.0003356242
## [15,] 0.0096753282 8.189861e-05 4.409979e-05 3.660200e-04 0.1658359967
## [,6] [,7] [,8] [,9] [,10]
## [1,] 0.0003900969 4.134061e-05 0.7246271312 4.116701e-06 2.433012e-04
## [2,] 0.0001773218 1.442521e-05 0.9990761915 4.859815e-07 4.610687e-05
## [3,] 0.0005136848 2.802837e-01 0.0009092437 4.088504e-04 3.924747e-04
## [4,] 0.0001080784 1.070262e-04 0.0003396418 2.201595e-04 5.971372e-05
## [5,] 0.0001018846 1.230228e-04 0.0003491499 2.168536e-04 4.684855e-05
## [6,] 0.0361072470 2.726899e-04 0.7976254325 1.193374e-04 2.931006e-04
## [7,] 0.8318028442 1.506552e-04 0.1621862564 5.096658e-05 1.924564e-04
## [8,] 0.0002922679 1.496150e-04 0.0010415188 2.707309e-04 1.965067e-04
## [9,] 0.0005444883 1.207367e-04 0.5377455360 3.002840e-05 1.554116e-04
## [10,] 0.0000368077 5.040847e-05 0.0004952048 2.087689e-04 8.135646e-05
## [11,] 0.1309241545 4.100243e-04 0.0465740086 5.036496e-03 1.226048e-03
## [12,] 0.0011369750 5.922394e-04 0.0004289024 4.928849e-01 5.851253e-04
## [13,] 0.0017248717 3.829752e-04 0.0001493163 5.722928e-01 5.328308e-04
## [14,] 0.0001308991 4.428319e-05 0.0001729407 5.780201e-05 1.044160e-04
## [15,] 0.8219002720 3.121304e-04 0.0013000300 3.133676e-04 1.708567e-04
Image Detection
To see how image detection works with some of the most famous scenes from the Harry Potter movies. I pick these three pictures as follows to experiment on.
Buckbeak’s Flight
–from Harry Potter and the Prisoner of Azkaban
Lakeside, seashore, volcano are among the highest probability – not bad!
library(image.darknet)
model <- system.file(package = "image.darknet", "include", "darknet", "cfg", "tiny.cfg")
weights <- system.file(package = "image.darknet", "models", "tiny.weights")
f <- system.file(package="image.darknet", "include", "darknet", "data", "imagenet.shortnames.list")
labels <- readLines(f)
darknet_tiny <- image_darknet_model(type = 'classify',
model = model, weights = weights, labels = labels)
x1 <- image_darknet_classify(file = "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/hp1.jpg", object = darknet_tiny)
x1## $file
## [1] "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/hp1.jpg"
##
## $type
## label probability
## 1 lakeside 0.16771728
## 2 seashore 0.13642623
## 3 sandbar 0.11876366
## 4 promontory 0.09261914
## 5 volcano 0.04462162
Setting the threshold to 0.1 here. It works out pretty well!
yolo_tiny_voc <- image_darknet_model(type = 'detect',
model = "tiny-yolo-voc.cfg",
weights = system.file(package="image.darknet",
"models", "tiny-yolo-voc.weights"),
labels = system.file(package="image.darknet", "include",
"darknet", "data", "voc.names"))
y1 <- image_darknet_detect(file = "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/hp1.jpg",
object = yolo_tiny_voc,
threshold = 0.1)Voldemort Vs. Dumbledore
– from Harry Potter and the Order of the Phoenix
To be fair, those sparkles really look like a barn spider, caldron, fountain, or volcano. And a space shuttle… probably not that wrong in this case.
x2 <- image_darknet_classify(file = "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/hp2.jpg", object = darknet_tiny)
x2## $file
## [1] "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/hp2.jpg"
##
## $type
## label probability
## 1 barn spider 0.06893399
## 2 caldron 0.04882356
## 3 fountain 0.04150808
## 4 volcano 0.03183554
## 5 space shuttle 0.03058341
I tried setting the threshold to multiple amounts, but the results did not turn out well. Here is a prediction picture when the threshold is set to 0.07. Since there is definitely not a cow nor a car in this scene, does this have something to do with the dim light?
yolo_tiny_voc <- image_darknet_model(type = 'detect',
model = "tiny-yolo-voc.cfg",
weights = system.file(package="image.darknet",
"models", "tiny-yolo-voc.weights"),
labels = system.file(package="image.darknet", "include",
"darknet", "data", "voc.names"))
y1 <- image_darknet_detect(file = "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/hp2.jpg",
object = yolo_tiny_voc,
threshold = 0.07)The Patronus
– from Harry Potter and the Prisoner of Azkaban
The image detection is not really working well here. Reasons to be discovered later on.
x3 <- image_darknet_classify(file = "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/hp3.jpg", object = darknet_tiny)
x3## $file
## [1] "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/hp3.jpg"
##
## $type
## label probability
## 1 ice bear 0.03816724
## 2 killer whale 0.02890317
## 3 valley 0.02306473
## 4 electric ray 0.02027919
## 5 tiger shark 0.01995512
Again, detection on this image is not working really well. And again, probably has something to do with the low light and dark tone. Down below is the prediction image when the threshold’s set to 0.15.
yolo_tiny_voc <- image_darknet_model(type = 'detect',
model = "tiny-yolo-voc.cfg",
weights = system.file(package="image.darknet",
"models", "tiny-yolo-voc.weights"),
labels = system.file(package="image.darknet", "include",
"darknet", "data", "voc.names"))
y1 <- image_darknet_detect(file = "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW3/hp3.jpg",
object = yolo_tiny_voc,
threshold = 0.15)Takeaway & Looking Forward
While sentiment analysis on single words might appear useful in some scenarios, this is not likely to be the case for literature, at least not for the Harry Potter series. As we proceed through the analysis, we could see that through single word sentiment analysis, negativity basically triumphs in all seven books. However, taking in the issue of negation words by conducting sentiment analysis of bigrams, we could see that the single word sentiment analysis might be misleading. From the sentiment analysis on the whole sentences, this flaw of single word sentiment becomes more pronounced.
Also, from the comparison of the results of sentiment analysis based on AFINN, bing, and nrc lexicons, we could see that they are not necessarily identical to each other. The differences are deeply rooted in the logic hidden behind each lexicon.
Thus, the accuracy of sentiment analysis is largely dependent on the methodology chosen. Everything should probably be done based on context. Further, more investigation on the relationship between analysis on unigrams, larger units of text, and sentences is needed.
As for the topic modeling and image detection - I’ve had fun and want to discover more image detection technology used on pictures with darker tones and lower light.
Lastly, as I created my own stop_words dictionary to eliminate the expected frequently used words like the main characters, the Hogwarts House names, and etc. I want to find better ways so as terms like, for example, “harry’s”, could be eliminated as well.
To be continued.